[ARM] [SDPA] SVE implementation of MHASingleToken for FP32 #27273

NishantPrabhuFujitsu · 2024-10-28T11:26:20Z

Details:

Adds SVE FP32 implementations for functions called during execution of MHASingleToken for SVE-128, SVE-256 and SVE-512 platforms.
SVE implementations are compiled only if runtime support for SVE is detected on the hardware, otherwise it falls back to Neon.
Adds a new implementation for exponential function exp_ps_<isa> using fewer FMA operations. Executes ~18% faster and has better output precision.

Note: I am aware of the Neon FP16 implementation of SDPA added recently. To accommodate for this, the current SVE changes will be used only if the hardware does not have ARM FP16 support. I will follow up with SVE FP16 implementations soon.

[SVE] Benchmarking results

Below are the benchmarking results of execution time of each ported function. Measurements were performed by running each function individually on dummy inputs (128 fp32 elements) for 1,000,000 iterations and computing average time (in micro-seconds).

Execution time of MHASingleToken as a whole was also measured for two LLMs, the results of which are shown below. For LlaMA-3-8B, the SVE-128 and SVE-512 systems at my disposal did not have enough memory, so only SVE-256 results are shown. While there is an improvement overall, these results could be contaminated with run-to-run variation due to the small execution time of the kernel.

Benchmarking details: Prompt length of 108 tokens was used; total time for generating 50 tokens was measured and average execution time was computed.

New exponential implementation

It is based on the discussion in these slides (this is based on a past talk in Fujitsu hence the document is in Japanese, sorry!). The algorithm followed is slightly different from the current implementation, in that it uses fexpa instruction available on ARM and requires only 3 Taylor expansion terms (2 FMA operations) to be precise until the 8th decimal place.

Our benchmarking results showed this implementation to be 44%-58% faster than the existing Neon implementation. It is ~18% faster than the SVE implementation of the current algorithm in Neon.

In this PR, the new implementation is called by default. The SVE port of the existing Neon implementation has also been retained, if needed.

src/plugins/intel_cpu/CMakeLists.txt

.gitignore

dmitry-gorokhov · 2024-10-29T07:35:46Z

.gitmodules

@@ -87,3 +87,6 @@
 [submodule "src/plugins/intel_cpu/thirdparty/shl"]
 	path = src/plugins/intel_cpu/thirdparty/shl
 	url = https://github.com/openvinotoolkit/shl.git
+[submodule "thirdparty/src/plugins/intel_npu/thirdparty/level-zero"]


Please remove all submodule changes from the PR.

I have (tried my best to) clean this up. Please let me know if it's not reverted yet.

If the changes haven't reverted, please guide me on how I can fix it. Sorry and thanks!

Submodule changes are still there. I just pushed the commit that reverts all unnecessery changes e2d5f11. Please apply it on top of your branch.
Please don't include any submodule changes in the commits. I would recommend to call in your working folder to have actual state.

git submodule init git submodule update

@dmitry-gorokhov I have applied your commit as a patch and now the changes seem to have reverted. Please take a look and let me know.

Meanwhile, I was trying to rebase my branch with master but I receive merge conflicts like so (conflict on the whole folder?):

Could you please help?

You can use "git merge ... " instead of "git rebase ..." to pick up latest master changes.
Or squash all 10 commits from your branch into single one and call "git rebase ..." after that

dmitry-gorokhov · 2024-10-29T07:37:20Z

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/common.hpp

@@ -246,6 +249,79 @@ static constexpr size_t vec_len_f16_neon = vec_len_neon / sizeof(ov::float16);
 #endif

 #ifdef OPENVINO_ARCH_ARM64
+#if defined(__ARM_FEATURE_SVE) && !defined(__ARM_FEATURE_FP16_VECTOR_ARITHMETIC)


Could you please clarify why we need !defined(__ARM_FEATURE_FP16_VECTOR_ARITHMETIC) check here?

Lets use HAVE_SVE instead of __ARM_FEATURE_SVE

It was a hotfix I had added to silence some errors when testing out my changes initially. They are no longer needed, so I have removed them in the latest commit.

Updated in the latest commit.

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/mha_single_token.cpp

ilya-lavrenov · 2024-10-29T12:46:57Z

src/inference/src/system_conf.cpp

+#    elif defined(__aarch64__) && defined(__APPLE__)
+    int64_t result(0);
+    size_t size = sizeof(result);
+    const std::string& cap = "hw.optional.sve";


I suppose it's just a placeholder? I don't see such HW capability on macOS

I checked for current or upcoming SVE support on Apple Metal, but couldn't find any. To ensure I don't miss it, I checked on Perplexity and this is what it suggested. I'd like your suggestion on whether it should be kept or removed.

I think it is ok to return false for Apple so far. We can consider SVE/SME support on M4 as separate activity.

Fixed to return "false" in latest commit.

src/plugins/intel_cpu/CMakeLists.txt

cmake/developer_package/compile_flags/os_flags.cmake

NishantPrabhuFujitsu requested review from a team as code owners October 28, 2024 11:26

NishantPrabhuFujitsu requested review from ilya-lavrenov and removed request for a team October 28, 2024 11:26

github-actions bot added category: CPU OpenVINO CPU plugin category: dependency_changes Pull requests that update a dependency file no-match-files category: NPU OpenVINO NPU plugin labels Oct 28, 2024

sys-openvino-ci added the ExternalPR External contributor label Oct 28, 2024

dmitry-gorokhov self-assigned this Oct 28, 2024

github-actions bot added the category: build OpenVINO cmake script / infra label Oct 28, 2024

ilya-lavrenov added the platform: arm OpenVINO on ARM / ARM64 label Oct 28, 2024

ilya-lavrenov reviewed Oct 28, 2024

View reviewed changes

src/plugins/intel_cpu/CMakeLists.txt Outdated Show resolved Hide resolved

src/plugins/intel_cpu/CMakeLists.txt Show resolved Hide resolved

dmitry-gorokhov reviewed Oct 29, 2024

View reviewed changes

NishantPrabhuFujitsu requested review from a team as code owners October 29, 2024 11:12

github-actions bot added category: inference OpenVINO Runtime library - Inference category: GPU OpenVINO GPU plugin category: Python API OpenVINO Python bindings and removed no-match-files labels Oct 29, 2024

ilya-lavrenov reviewed Oct 29, 2024

View reviewed changes

src/plugins/intel_cpu/CMakeLists.txt Outdated Show resolved Hide resolved

ilya-lavrenov reviewed Oct 29, 2024

View reviewed changes

cmake/developer_package/compile_flags/os_flags.cmake Outdated Show resolved Hide resolved

github-actions bot removed category: GPU OpenVINO GPU plugin category: Python API OpenVINO Python bindings category: dependency_changes Pull requests that update a dependency file category: NPU OpenVINO NPU plugin labels Oct 30, 2024

ilya-lavrenov reviewed Oct 30, 2024

View reviewed changes

cmake/developer_package/compile_flags/os_flags.cmake Outdated Show resolved Hide resolved

cmake/developer_package/compile_flags/os_flags.cmake Outdated Show resolved Hide resolved

NishantPrabhuFujitsu force-pushed the mha-single-token-arm-sve-f32 branch from 14ca0b6 to 76d0305 Compare October 31, 2024 06:30

[CPU][ARM] Adds SVE F32 implementation for MHASingleToken sub-functions

43ad2e1

NishantPrabhuFujitsu force-pushed the mha-single-token-arm-sve-f32 branch from 76d0305 to 43ad2e1 Compare October 31, 2024 06:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ARM] [SDPA] SVE implementation of MHASingleToken for FP32 #27273

[ARM] [SDPA] SVE implementation of MHASingleToken for FP32 #27273

NishantPrabhuFujitsu commented Oct 28, 2024 •

edited

Loading

dmitry-gorokhov Oct 29, 2024

NishantPrabhuFujitsu Oct 29, 2024

dmitry-gorokhov Oct 30, 2024

NishantPrabhuFujitsu Oct 30, 2024

dmitry-gorokhov Oct 30, 2024

dmitry-gorokhov Oct 29, 2024

NishantPrabhuFujitsu Oct 29, 2024 •

edited

Loading

ilya-lavrenov Oct 29, 2024

NishantPrabhuFujitsu Oct 29, 2024

dmitry-gorokhov Oct 30, 2024

NishantPrabhuFujitsu Oct 30, 2024

[ARM] [SDPA] SVE implementation of MHASingleToken for FP32 #27273

Are you sure you want to change the base?

[ARM] [SDPA] SVE implementation of MHASingleToken for FP32 #27273

Conversation

NishantPrabhuFujitsu commented Oct 28, 2024 • edited Loading

Details:

[SVE] Benchmarking results

New exponential implementation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NishantPrabhuFujitsu Oct 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NishantPrabhuFujitsu commented Oct 28, 2024 •

edited

Loading

NishantPrabhuFujitsu Oct 29, 2024 •

edited

Loading